Photonic machine learning with on-chip diffractive optics

Machine learning technologies have been extensively applied in high-performance information-processing fields. However, the computation rate of existing hardware is severely circumscribed by conventional Von Neumann architecture. Photonic approaches have demonstrated extraordinary potential for executing deep learning processes that involve complex calculations. In this work, an on-chip diffractive optical neural network (DONN) based on a silicon-on-insulator platform is proposed to perform machine learning tasks with high integration and low power consumption characteristics. To validate the proposed DONN, we fabricated 1-hidden-layer and 3-hidden-layer on-chip DONNs with footprints of 0.15 mm2 and 0.3 mm2 and experimentally verified their performance on the classification task of the Iris plants dataset, yielding accuracies of 86.7% and 90%, respectively. Furthermore, a 3-hidden-layer on-chip DONN is fabricated to classify the Modified National Institute of Standards and Technology handwritten digit images. The proposed passive on-chip DONN provides a potential solution for accelerating future artificial intelligence hardware with enhanced performance.


Supplementary Note 1.2: Forward and error backward propagation (FEBP)
The forward propagation process can be expressed by Eq. where represents a neuron of the -th layer, is the output function of the -th neuron, is the function of electromagnetic propagation, refers to a neuron of the next layer that is connected to neuron by optical diffraction, is the input wave to the -th neuron of layer , is the transmission coefficient of the -th neuron of layer , and are the amplitude and phase factors, here ≈ 1 [1] . The error backward propagation is realized by the gradient descent algorithm. By assuming that the on-chip DONN consists of layers (excluding the input and output layer), then the loss function ( ) can be defined as Eq. (S-2): ( ) where ( ) is the loss function, +1 is the detecting intensity of the output areas, +1 is the target intensity of the output areas, refers to the number of measurement points at the output detected area. Therefore, the problem of the optimization for an on-chip DONN can be summarized as min ( ) , 0 < ≤ 2 . The gradient calculation formula of the loss function is depicted in Eq. (S-3): ( ) ( ) 11   where, 3 ≤ ≤ − 1. So far, the parameters of an on-chip DONN can be obtained through continuous optimization in the training process by the gradient descent algorithm. The forward and error backward propagation can also refer to previous works [1,2] .

Supplementary Note 1.3: Neuron-mapping process
Based on the OEP model and FEBP process, all parameters of on-chip DONNs can be obtained. Subsequently, it is significantly important to accurately map the pretrained parameters onto physical structures. In particular, the neuron-mapping process is the most critical and challenging. In this study, silicon-based slots are used to realize the pretrained phase values of neurons. According to the effective medium theory (EMT) and FDTD simulation, the effective refractive index (ERI) of the silicon slot filled with silicon dioxide (SSSD) under different widths can be calculated [1,3] . Here, the thickness of the SSSD is 220 nm, the lattice constant (the center distance between two adjacent SSSDs) of the SSSD is fixed as 500 nm, and the width of the SSSD is set as 200 nm by considering the practical fabrication conditions. Therefore, the ERI of the SSSD is calculated as 2.166 by Eq. (2), which is listed here again, as shown in Eq. (S-5).
where − represents the length of the SSSD in the -th group, represents the ERI of the slot group filled with silicon dioxide through which light passes, represents the ERI of the slab waveguide, 0 = 2 ⁄ is the wavenumber of light propagating in vacuum, ∆ represents the phase delay generated by the -th slot group filled with silicon dioxide. According to Eq. (S-5), the length of the SSSD can be calculated by the pretrained phase value ∆ . Thus, the in Eq. (S-5) should be a fixed value during the pre-training process. However, there is mutual interference between adjacent SSSD [1] . When the pretrained neuron values after being mapped onto the physical structures, the realized by the physical structure unit is different from that in the pre-training stage and varies with the change of SSSD length, which seriously affects the neuron-mapping process. To reduce the impacts of mutual interference between adjacent SSSDs during the neuron-mapping process, a slot group filled with silicon dioxide composed of multiple identical SSSDs is used to approximate a phase value of a neuron. The calculated by the phase delay generated by the slot group with a different number of SSSDs is shown in Fig. S2a. It is evident that when the number of orange-yellow SSSD increases, the ERI (the red triangle) calculated by the phase delay generated by the slot groups tends to be a stable value. In addition, the distance between the input light and the hidden layer (HL) will also affect the accuracy of the neuron-mapping process. As shown in Fig. S2b, when the incident light propagates to the HL for 30 µm, the calculation result of the ERI of the identical SSSD by FDTD is different. When the distance between the input light and the HL increases to 250 μm, the value of the ERI tends to be stable. When the distance further increases to 450 μm, the overall change of the ERI of the SSSD is smaller. Therefore, by a comprehensive consideration, the distance between the input light and the first HL and the distance between adjacent HLs are set to 250 µm in this study. The relevant contents can also refer to the previous work [1] .

Supplementary Figure 2. Calculation results of effective refractive index (ERI) based on FDTD. a Calculation
of the ERI under a different number of identical silicon slots filled with silicon dioxide in a slot group. The incident light source is flat light, and the center distance of the adjacent silicon slot filled with silicon dioxide (SSSD) is 500 nm, the width of the SSSD is 200 nm, the thickness is 220 nm, and the length is 1.964 μm. The length of 1.964 μm is randomly generated. b The incident light source is a point light, and the distance between the light source and the HL is 30 µm, 250 µm, and 450 µm, respectively. Among them, the HL consists of 100 identical SSSDs, numbered from 1 to 100. The dots are the ERI values calculated by the phase delay ∆ generated by the light passing through the SSSD of the corresponding sequence number. 0 is the ERI in the pre-training process, the period of the SSSD is 500 nm, the width of the SSSD is 200 nm, the thickness is 220 nm, and the length is 2 μm.   Table 1 indicates the prediction accuracy of the on-chip DONNs for classification task on the Iris plants dataset, it is evident that with the increase in the number of HLs ( means the number of HLs in the on-chip DONN), the prediction accuracy has not been significantly improved. Therefore, in the process of network design, it is necessary to comprehensively consider the performance of the network, the energy efficiency, and the integration of the chip.    Supplementary Table 2 indicates the prediction accuracy of the on-chip DONNs for classification task of the MNIST handwritten digit images. Meanwhile, the letter 'n' represents the number of HLs in the on-chip DONN (e.g., the DONN-M3 indicates the DONN includes three HLs).

Supplementary Note 3: Numerical analysis of phase errors
Specific phase values of neurons can be pretrained and obtained in advance, and then the physical connection of these phase values on hardware can be realized through physical structures. However, inevitable errors would be caused in the fabrication process. Eq. (S-6) is the transfer matrix of the on-chip DONNs. Here, we assume that the error factor is = ( 1 , 2 ,⋅⋅⋅, ), whose values are ranging from 0 to 1. The phase value ℎ including machining errors is the product of the original phase value ℎ and the error factors, as shown in Eq. (S-8). (S-7) and (S-9) are the diagonal matrixes before and after the fabrication errors are introduced, respectively.
where is the input signals, is the output results, ( = 1,2, … , + 1) is the diffractive connection matrix, ℎ is the pretrained parameters, is the error factor generated by the fabrication, ℎ is the product of ℎ and , and are diagonal matrices.

Supplementary Note 4: On-chip DONN-system: fabrication and measurement Supplementary Note 4.1: Device fabrication of on-chip DONN-I1 and DONN-I3
To experimentally evaluate the performance of the proposed on-chip DONNs, on-chip DONN-I1 and DONN-I3 were fabricated. For example, Fig. S6 shows the micrograph of Chip-3, which contains the on-chip DONN-I3 with three HLs. The input signals are loaded by two cascaded 1 × 2 Multimode interferometers (MMI) and four phase shifters. In this design, the thickness and width of the single-mode waveguide are 220 nm and 450 nm, and the footprint of the on-chip DONN-I1 and DONN-I3 are about 0.3 × 0.5 mm 2 and 0.3 × 1.0 mm 2 , respectively. Fig. S7 is a picture of the chip after packaging.
Supplementary Figure 6. Micrograph of Chip-3 and related devices. a Micrograph of Chip-3. The signal is loaded onto the waveguide through the input grating coupler (IGC); then, they are divided into four channels by two cascaded 1 × 2 MMIs. The input phase information is loaded by a thermo-optical phase shifter on each waveguide. After the signal is loaded onto the phase of light, it passes through three HLs (metasurfaces) and enters the output waveguide at the output interface. Finally, it is captured by the optical power meter via the output grating couplers (OGC). AS is the power attenuation structure, which is to prevent the negative impacts of light reflection in the nondetection area of the output interface. b Close-up of the HL, in which a silicon slot group consists of three identical slots filled with silicon dioxide to map the phase values onto physical structures.

.2: Experimental demonstration of on-chip DONN-I1 and DONN-I3
For the Iris plants dataset, each sample has four features, namely calyx length, calyx width, petal length, and petal width. The category of Iris flower species can be identified through these features. In this experiment, the signal is loaded via thermo-optical modulators. The relationship between the voltage and phase curve is shown in Fig. S8a. The experimental testing flow chart for predicting the classification tasks is shown in Fig. S8b. Fig. S8c shows the experimental devices. In addition, due to the fabrication errors, the testing accuracies of the Iris plants dataset of on-chip DONN-I1 and DONN-I3 are 56.7% and 60%, respectively, without introducing the algorithm compensation in experiments; When the algorithm compensation is introduced into the experiments, the testing accuracies of the on-chip DONN-I1 is improved to 86.7%, and DONN-I3 is improved to 90%. Fig. S9 shows the experimental testing results before and after introducing the algorithm compensation.
Supplementary Figure 9. Experimental test results of on-chip DONN-I1 and DONN-I3. a, b Experimental testing results of the on-chip DONN-I1, in which the red dot represents " etosa," the blue dot represents "Versicolor," and the purple dot represents "Virginia." In addition, the correct prediction result of serial numbers 1-10 should be " etosa," in other words, the power value corresponding to the red dot should be the ma imum. The correct prediction result of serial numbers 11-20 should be "Versicolor," which means that the power value corresponding to the blue dot should be the largest. The correct prediction result of serial numbers 21-30 should be "Virginia," at this time, the power value corresponding to the purple dot should be the largest. For example, in Fig. S9a, the power value of the red dot for the No. 5 test sample is the largest, so the prediction is correct; while the power of the blue dot for the No. 15 testing sample should be the largest, but the power of the red dot is the largest at this time, so the prediction is incorrect. c, d Testing results of the on-chip DONN-I3, and the data analysis is the same as a and b. The intensity is normalized according to the specific sample; where, the unit of the normalized power 'a.u.' is the abbreviation of 'arbitrary unit'.

Supplementary Note 4.3: Device fabrication of on-chip DONN-M3
The general fabrication process of the on-chip DONN-M3 is the same as that of on-chip DONN-I1 and DONN-I3. The difference is that the materials of heating electrodes and metal wires are different. The details are introduced in the "Methods" section of the main text. The on-chip DONN-M3 is designed and fabricated for the MNIST classification task. The input 28 × 28 grayscale image is reshaped into a 784 × 1 vector and compressed by the input layer into ten features, thus the fabricated chip has 10 input gratings and 10 output gratings. The input gratings are used for signal loading of 10 features after compression, and the 10 output gratings are used for recognition of handwritten digits (0-9). The microscope and scanning electron microscopy (SEM) images of the on-chip DONN-M3 are shown in Fig. S10a and S10b, respectively. The picture after wiring and packaging of the on-chip DONN-M3 is shown in Fig. S10c, and the experimentally obtained phase modulation curve as a function of the applied voltage is shown in Fig. S10d.

Supplementary Note 5: Compensation analysis of on-chip DONN errors Supplementary Note 5.1: System error compensation for the Iris flower classifier
Here is a specific example for the compensation analysis of on-chip DONN errors, Fig. S11a is a 1-hidden-layer DONN. Fig. S11b is the logic diagram of Fig. S11a, and Fig. S11c is the equivalent logic diagram after introducing the experimental compensation algorithm. The mathematical calculation process of Fig. S11a is expressed by Eq. (S-10). When the inevitable errors are brought by the signal loading and fabrication process, the mathematical calculation process is expressed by Eq. (S-11). To reduce the impacts of signal loading and machining errors on the performance of on-chip DONNs, two steps are adopted to reduce the negative influences caused by the signal loading and machining errors. First, the phase compensation is introduced in the signal loading stage, meanwhile, the compensated phases of ( 11 , 22 , 33 , 44 ) are generated by the voltages of (∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 ) according to Fig.  S8a. Then, in the signal detection stage, the output power of different ports is multiplied by a power compensation factor, which refers to Fig. S11c for the specific process. Among them, the phase compensation process in the first stage is depicted by Eq. (S-12). The power detection process is depicted by Eq. (S-13), and the power compensation stage is expressed by Eq. (S-14).
Supplementary Figure 11. On-chip DONN calculation process and its error compensation process. a On-chip DONN with one hidden layer. b Logic diagram of Fig. S11a. c Equivalent logic diagram of Fig. S11b after introducing the experimental compensation algorithm.  (2) (1) For the phase compensation stage, a set of optimized fixed voltage values(∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 ) can be found through the corresponding algorithm. When this set of fixed voltage values is found, phase compensation can be realized by adding the original signal voltage value to this set of voltage values; For the power compensation stage, based on the phase compensation result, a set of optimized fixed coefficients ( 1 , 2 , 3 ) can be found through the optimization search algorithm. Finally, by multiplying the output power of all samples by the set of fixed power coefficients, the whole error compensation stage can be completed. P1, P2 and P3 represent the optical power detected by different output ports respectively.

Supplementary Note 5.2: System error compensation for the Handwritten digit classifier
The dimension of the MNIST dataset is relatively high, thus the compensation method for online in-situ training would take a long time even if 100 test sets are selected. However, based on the practical existing experimental conditions, the stability of the experimental system cannot be guaranteed for a long time, due to the optical path would be affected by polarization, ambient temperature, and vibration during the signal loading stage. Thus, the online in-situ training compensation would be difficult. Therefore, we chose the second part of the compensation method introduced in Supplementary Note 5.1 as shown in Eq. (S- 14) for getting the error compensation factors offline. The MNIST classification task is more complex than the iris classification task, the system errors are more difficult to be corrected, thus a 10 × 10 full connection layer ( ) after the DONN-M3 chip is adopted to compensate the system errors, and the mathematical expression of the compensation process is shown in Eq. (S-15). The full connection layer can be obtained through machine learning. When the training is completed, the full connection layer is fixed in the process of the blind test sets classification.

Supplementary Note 6: Algorithm compensation
To compensate for the fabrication errors, an algorithm compensation method, including phase compensation and power compensation, is used to reduce the negative impacts of the introduced errors. Here, the input signals were applied to the phases of light through the input voltages, and the phase modulation curve is shown in Fig. S8a. : Average loss for each particle; : The total number of samples which were classified correctly; : Accuracy after power compensation; : The best performance for the particles through training; : The best performance for the whole particle group through training; : The average loss of configurations; () Sort  : Picking the best configurations from the samples; () Find  : Finding the corresponding voltage compensation configurations with respect to the power compensations.
() Generate  : Generating the in the next iteration for particles.

Supplementary Note 6.2: Pseudo code of the optimization process
Error compensation is an optimization problem, which aims to minimize the impact of errors on the performance of the system. The optimization problem is described mathematically as Eq. (S-16).
To find the minimum value of the optimization object function above, two steps are required: Step 1. Compensations on voltages by fixing =1 to make the loss function which is described as Eq. (S-17) to be converged, the compensation in this stage is called phase compensation, where ⊙ indicates multiplication with the corresponding element.
In this compensation, the optimized best voltage compensation configurations in each iteration will become the candidates for the power compensation in the next step.
Step 2. Compensation on power by fixing the best voltage compensation configurations in each iteration of the previous, = (∆ 1 , ∆ 2 , ∆ 3 , ∆ 4 ), and then finding a set of optimal power compensation factors = ( 1 , 2 , 3 ) by the traversal search method to further increase the accuracy of classification.
Before the algorithm starts, several determinations are required as: velocities), should be determined. Other parameters related to the compensation algorithm should also be set. Then, this algorithm will randomly generate the group of compensation voltage configurations particle by particle. Afterwards, the algorithm will run into iterations to find the candidate which can obtain the lowest value of the loss function. As can be vividly seen from Fig. S12, before the beginning of the -th iteration, the compensation voltage configurations for each particle are generated based on the best particle performance and other statistical features extracted from all particles in the ( -1)-th iteration. The footnote represents the particle's index. If is lower than the maximum iteration number or the loss value of ( -1)-th iteration is larger than the minimum loss value, the algorithm will operate into the -th iteration. Otherwise, the whole iteration will expire. In the i-th iteration, performances of different particles' compensation voltage configurations on different samples from the training dataset are needed to be recorded. For each compensation voltage configuration of one particle, PC will add the into the original voltage of each sample from the training dataset. Then, these modified voltages will be sent onto the phase-shifter via the I/O module and the driver (Fig. S8). The intensities of the output channels without power compensation are transmitted back to the central processing unit (CPU) to calculate the loss values for each sample. If voltages of all samples are sent, then the algorithm runs into the evaluation of particles' performances. For each particle, the average loss can be computed by averaging the single loss value for each sample. fter all particles' performances are evaluated, the best performance can be obtained from the best performance of each particle for the lowest loss. The corresponding value is called the group best. The iteration will be executed and all results and voltage configurations will be recorded in each iteration until either iteration's inde exceeds the maximum value or the loss value is lower than the targeted value. The recorded group best configurations in each iteration will participate in the power compensation stage. In the power compensation stage, power compensation factors are adopted via the traversal method to further increase the classification accuracy. By pre-determining both maximum and minimum power compensation factor's boundary with searching steps, one can record the accuracies with respect to different power compensation factors and voltage configurations from each iteration's in the phase compensation stage, and the best power compensation factor can be found according to the maximum accuracy. In the end, this algorithm compensation method can find the best and for the minimum value of the target function and achieve the best prediction accuracy. Additionally, other algorithms can also be considered for experimental error compensation, such as the compensation algorithms used in the previous works [4,5] .

Supplementary Note 7: On-chip DONN computational speed and power consumption Supplementary Note 7.1: Computation speed
By assuming that an on-chip DONN system has nodes and a 100 GHz data detection rate, the system includes layers, while each layer contains an × weighting matrix, the number of floating-point operations per second (FLOPS) can be evaluated by the Eq. (S-18) [6] . Taking the on-chip DONN-I3 we have already fabricated as an example, here = 2, = 186, therefore, the computing speed of the DONN-I3 can reach 1.38 × 10 16 FLOPS. where 1 is the distance from the narrow waveguide to the slab waveguide; 0 is the vacuum light speed, 1 is the ERI of the narrow waveguide; is the total number of the hidden layers; is the length of the longest slot in the hidden layers; 2 is the ERI of the silicon slots filled with silicon dioxide; 2 is the distance from the start of the slab waveguide to the first hidden layer; is the propagation distance between the hidden layers; 3 is the propagation distance between the last hidden layer and the output layer; 3 is the ERI of the slab waveguide. As an example, for our designed Chip-3, 1 = 2.33, 2 = 2.166, 3

Supplementary Note 7.2: Power consumption
First, the input power of the laser under 1.55 μm is 32 mW. The input signal is loaded by the thermo-optic phase shifters, and the average energy required to set each phase shifter a 2π rad is approximately 30 mW. In addition to the light source and the signal loading stage, the calculation process of the computing part in our proposed on-chip DONN is fully passive, so the system does not need to consume extra energy in processing the inference tasks. Therefore, in terms of theoretical calculation, the energy consumed to complete one calculation for the proposed on-chip DONN-I3 system is about 1.1 × 10 −17 /FLOP. Moreover, to reduce the power consumption of the phase shifters, one could use doped side heaters [8,9] or liquid-crystal-based phase shifters [10] . In addition, for the on-chip DONN system, the power consumption of the metasystem is the summation of the power required for propagation and the optical power required to support an optical nonlinearity that could be potential implementations of future devices. Here, the transmission loss of light in the working process of on-chip DONN system is ignored, and power consumption is mainly required to support an optical nonlinearity. Therefore, assume the saturation power for our saturable absorber is MW/cm 2 ( ≈ 1, . g. graph n ), and an area of a neuron = 1. × 2.0 , then the total power needed to run the system is estimated to be = × × = 3 ( ), meanwhile, is the neuron number per layer.
For the proposed on-chip DONN-I3 system ( = 186), the input power of the laser is approximately 32 mW and the power required to maintain the operation of the phase shifter is × 30 mW, the operations per second is = 1.38 × 10 16 FLOPS, thus the power consumption is approximately .1 × 10 −17 /FLOP.